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Abstract 

We study the problem of finding structured low-rank matrices using nuclear norm regular¬ 
ization where the structure is encoded by a linear map. In contrast to most known approaches 
for linearly structured rank minimization, we do not (a) use the full SVD; nor (b) resort to aug¬ 
mented Lagrangian techniques; nor (c) solve linear systems per iteration. Instead, we formulate 
the problem differently so that it is amenable to a generalized conditional gradient method, 
which results in a practical improvement with low per iteration computational cost. Numerical 
results show that our approach significantly outperforms state-of-the-art competitors in terms 
of running time, while effectively recovering low rank solutions in stochastic system realization 
and spectral compressed sensing problems. 


1 Introduction 

Many practical tasks involve finding models that are both simple and capable of explaining noisy 
observations. The model complexity is sometimes encoded by the rank of a parameter matrix, 
whereas physical and system level constraints could be encoded by a specific matrix structure. Thus, 
rank minimization subject to structural constraints has become important to many applications in 
machine learning, control theory, and signal processing [10, 22]. Applications include collaborative 
filtering [23], system identification and realization [19, 21], multi-task learning [28], among others. 

The focus of this paper is on problems where in addition to being low-rank, the parameter 
matrix must satisfy additional linear structure. Typically, this structure involves Hankel, Toeplitz, 
Sylvester, Hessenberg or circulant matrices [4, 11, 19]. The linear structure describes interdepen¬ 
dencies between the entries of the estimated matrix and helps substantially reduce the degrees of 
freedom. 

As a concrete example consider a linear time-invariant (LTI) system where we are estimating 
the parameters of an autoregressive moving-average (ARMA) model. The order of this LTI system, 
i.e., the dimension of the latent state space, is equal to the rank of a Hankel matrix constructed 
by the process covariance [20]. A system of lower order, which is easier to design and analyze, 
is usually more desirable. The problem of minimum order system approximation is essentially a 
structured matrix rank minimization problem. There are several other applications where such 
linear structure is of great importance—see e.g., [11] and references therein. Furthermore, since 
(enhanced) structured matrix completion also falls into the category of rank minimization problems. 
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the results in our paper can as well be applied to specific problems in spectral compressed sensing 
[6], natural language processing [1], computer vision [8] and medical imaging [24], 

Formally, we study the following (block) structured rank minimization problem: 

minj^ l\\A{y)-b\\l +fi-rank{Qm,n,j,k{y))- ( 1 ) 

Here, y = (yi, ...,yj^k-i) is an m x n{j + k — 1) matrix with yt G for t = 1,..., j + A: — 1, 

A : ig linear map, b G MP, Qm,n,j,k{y) G -g ^ structured matrix whose 

elements are linear functions of y^’s, and /x > 0 controls the regularization. Throughout this paper, 
we will use M = mj and N = nk to denote the number of rows and columns of Qm,n,j,k (y)- 

Problem (1) is in general NP-hard [21] due to the presence of the rank function. A popular 
approach to address this issue is to use the nuclear norm || • ||*, i.e., the sum of singular values, as 
a convex surrogate for matrix rank [22]. Doing so turns (1) into a convex optimization problem: 

rniny l\\A{y) - b\\j + y- ||Qm,n,i,fc(y)||*- (2) 

Such a relaxation has been combined with various convex optimization procedures in previous work, 
e.g., interior-point approaches [17, 18] and first-order alternating direction method of multipliers 
(ADMM) approaches [11]. However, such algorithms are computationally expensive. The cost per 
iteration of an interior-point method is no less than and that of typical proximal and 

ADMM style first-order methods in [11] is NM'^))] this high cost arises from each 

iteration requiring a full Singular Value Decomposition (SVD). The heavy computational cost of 
these methods prevents them from scaling to large problems. 

Contributions. In view of the efficiency and scalability limitations of current algorithms, the 
key contributions of our paper are as follows. 

• We formulate the structured rank minimization problem differently, so that we still find low- 
rank solutions consistent with the observations, but substantially more scalably. 

• We customize the generalized conditional gradient (GCG) approach of Zhang et al. [27] to 
our new formulation. Compared with previous first-order methods, the cost per iteration is 
0{MN) (linear in the data size), which is substantially lower than methods that require full 
SVDs. 

• Our approach maintains a convergence rate of O (^) and thus achieves an overall complexity of 

O , which is by far the lowest in terms of the dependence of M or V for general structured 

rank minimization problems. It also empirically proves to be a state-of-the-art method for (but 
clearly not limited to) stochastic system realization and spectral compressed sensing. 

We note that following a GCG scheme has another practical benefit: the rank of the intermediate 
solutions starts from a small value and then gradually increases, while the starting solutions ob¬ 
tained from existing hrst-order methods are always of high rank. Therefore, GCG is likely to find 
a low-rank solution faster, especially for large size problems. 

Related work. Liu and Vandenberghe [17] adopt an interior-point method on a reformulation 
of (2), where the nuclear norm is represented via a semidehnite program. The cost of each iteration 
in [17] is no less than O(M^V^). Ishteva et al. [15] propose a local optimization method to solve the 
weighted structured rank minimization problem, which still has complexity as high as O(V^Mr^) 
per iteration, where r is the rank. This high computational cost prevents [17] and [15] from handling 
large-scale problems. In another recent work, Fazel et al. [11] propose a framework to solve (2). 
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They derive several primal and dual reformulations for the problem, and propose corresponding 
first-order methods such as ADMM, proximal-point, and accelerated projected gradient. However, 
each iteration of these algorithms involves a full SVD of complexity N'^M)), making 

it hard to scale them to large problems. Signoretto et al. [25] reformulate the problem to avoid 
full SVDs by solving an equivalent nonconvex optimization problem via ADMM. However, their 
method requires subroutines to solve linear equations per iteration, which can be time-consuming 
for large problems. Besides, there is no guarantee that their method will converge to the global 
optimum. 

The conditional gradient (CG) (a.k.a. Frank-Wolfe) method was proposed by Frank and Wolfe 
[12] to solve constrained problems. At each iteration, it first solves a subproblem that minimizes 
a linearized objective over a compact constraint set and then moves toward the minimizer of the 
cost function. CG is efficient as long as the linearized subproblem is easy to solve. Due to its 
simplicity and scalability, CG has recently witnessed a great surge of interest in the machine 
learning and optimization community [16]. In another recent strand of work, CG was extended to 
certain regularized (non-smooth) problems as well [3, 13, 27]. In the following, we will show how a 
generalized CG method can be adapted to solve the structured matrix rank minimization problem. 

2 Problem Formulation and Approach 

In this section we reformulate the structured rank minimization problem in a way that enables 
us to apply the generalized conditional gradient method, which we subsequently show to be much 
more efficient than existing approaches, both theoretically and experimentally. Our starting point 
is that in most applications, we are interested in finding a “simple” model that is consistent with 
the observations, but the problem formulation itself, such as (2), is only an intermediate means, 
hence it need not be fixed. In fact, when formulating our problem we can and we should take the 
computational concerns into account. We will demonstrate this point first. 


2.1 Problem Reformulation 

The major computational difficulty in problem (2) comes from the linear transformation Qm,n,j,k{ ) 
inside the trace norm regularizer. To begin with, we introduce a new matrix variable X G "^fnjxnk 
and remove the linear transformation by introducing the following linear constraint 


Qm,n,j,k(,y) — X. 

For later use, we partition the matrix X into the block form 


(3) 


X := 


xn 

X2l 


Xl2 

X22 


^Ik 

X2k 


with Xj; G for f = l,...,j, l = 


* * * ^jk 


(4) 


We denote by x := vec(X) G the vector obtained by stacking the columns of X 

blockwise, and by X := mat(x) G the reverse operation. Since x and X are merely 

different re-orderings of the same object, we will use them interchangeably to refer to the same 
object. 
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We observe that any linear (or slightly more generally, affine) structure encoded by the linear 
transformation Qm,n,j,ki') translates to linear constraints on the elements of X (such as the sub¬ 
blocks in (4) satisfying say xi 2 = X 2 i)-, which can be represented as linear equations Bx = 0, with 
an appropriate matrix B that encodes the structure of Q. Similarly, the linear constraint in (3) 
that relates y and X, or equivalently x, can also be written as the linear constraint y = Cx for a 
suitable recovery matrix C. Details on constructing matrix B and C can be found in the appendix. 
Thus, we reformulate (2) into 


min \\\A{Cx) -b\\l + n\\X\\^ (5) 

3.g]Rmjfcxn 

s.t. Bx = 0. (6) 

The new formulation (5) is still computationally inconvenient due to the linear constraint (6). We 
resolve this difficulty by applying the penalty method, i.e., by placing the linear constraint into the 
objective function after composing with a penalty function such as the squared Frobenius norm; 

min l\\A{Cx) - b\\j + ^\\Bx\\j + y\\X\U. (7) 

Here A > 0 is a penalty parameter that controls the inexactness of the linear constraint. In essence, 
we turn (5) into an unconstrained problem by giving up on satisfying the linear constraint exactly. 
We argue that this is a worthwhile trade-off for (i) By letting A t oo and following a homotopy 
scheme the constraint can be satisfied asymptotically; (ii) If exactness of the linear constraint is 
truly desired, we could always post-process each iterate by projecting to the constraint manifold 
using Cproj (see appendix); (hi) As we will show shortly, the potential computational gains can be 
significant, enabling us to solve problems at a scale which is not achievable previously. Therefore, 
in the sequel we will focus on solving (7). After getting a solution for x, we recover the original 
variable y through the linear relation y = Cx. As shown in our empirical studies (see Section 3), the 
resulting solution Qm,n,j,k{y) indeed enjoys the desirable low-rank property even with a moderate 
penalty parameter A. We next present an efficient algorithm for solving (7). 

2.2 The Generalized Conditional Gradient Algorithm 

Observing that the first two terms in (7) are both continuously differentiable, we absorb them into 
a common term / and rewrite (7) in the more familiar compact form: 

min </>(A) :=/(A)+^||A||„ (8) 

which readily fits into the framework of the generalized conditional gradient (GCG) [3, 13, 27]. In 
short, at each iteration GGG successively linearizes the smooth function /, finds a descent direction 
by solving the (convex) subproblem 

Zfc G arg min (Z, V/(Afc_i)), (9) 

|| Z ||*<1 

and then takes the convex combination X^ = (1 — with a suitable step size % 

and scaling factor a^. Glearly, the efficiency of GCG heavily hinges on the efficacy of solving the 
subproblem (9). In our case, the minimal objective is simply the matrix spectral norm of — V/(Afc) 
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and the minimizer can be chosen as the outer product of the top singular vector pair. Both can be 
computed essentially in linear time 0{MN) using the Lanczos algorithm [7]. 

To further accelerate the algorithm, we adopt the local search idea in [27], which is based on 
the variational form of the trace norm [26]: 

||X||, = \ min{||C/||2 + \\V\\j : X = UV}. (10) 

The crucial observation is that (10) is separable and smooth in the factor matrices U and V, 
although not jointly convex. We alternate between the GCG algorithm and the following nonconvex 
auxiliary problem, trying to get the best of both ends: 

min V’(C/,V^), where V^(C/, G) =/([/G) + M(||[/||2 + ||y||2). ( 11 ) 

Since our smooth function / is quadratic, it is easy to carry out a line search strategy for finding 
an appropriate ak in the convex combination Xk+i = (1 — i]k)Xk + r]k{akZk) =: (1 — 'nk)Xk + OkZk, 
where 

9k = aigmin hk{6) (12) 

e>o 

is the minimizer of the function (on 9>0) 

hk{9) := /((I - Vk)Xk + eZk) + - Vk)\\Xk\U + (13) 

In fact, hk{9) upper bounds the objective function (f) at {1 — r]k)Xk + 6Zk. Indeed, using convexity, 

4>{{1 — rjk)Xk + OZk) = /((! — rik)Xk + OZ^) + 1^11(1 — 'r]k)Xk + 9Zk\\* 

< /((I ~ Vk)Xk + OZk) + 1^(1 — r/A:)||^fc||* + 

< /((I — r/fc)Xfc + 9Zk) + 1^(1 — ? 7 fc)||Xfc||* + ^6 (as ||^fc||* < 1) 

= hk{e). 

The reason to use the upper bound hk{0), instead of the true objective 4>{{1 — rjk)Xk + OZk), is to 
avoid evaluating the trace norm, which can be quite expensive. More generally, if / is not quadratic, 
we can use the quadratic upper bound suggested by the Taylor expansion. It is clear that 9k in 
(12) can be computed in closed-form. 

We summarize our procedure in Algorithm 1. Importantly, we note that the algorithm explicitly 
maintains a low-rank factorization X = UV throughout the iteration. In fact, we never need the 
product X, which is a crucial step in reducing the memory footage for large applications. The 
maintained low-rank factorization also allows us to more efficiently evaluate the gradient and its 
spectral norm, by carefully arranging the multiplication order. Finally, we remark that we need not 
wait until the auxiliary problem (11) is fully solved; we can abort this local procedure whenever 
the gained improvement does not match the devoted computation. For the convergence guarantee 
we establish in Theorem 1 below, only the descent property i^{UkVk) < '4){Uk-iVk-i) is needed. 
This requirement can be easily achieved by evaluating which, unlike the original objective cj), is 
computationally cheap. 

2.3 Convergence analysis 

Having presented the generalized conditional gradient algorithm for our structured rank minimiza¬ 
tion problem, we now analyze its convergence property. We need the following standard assumption. 
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Algorithm 1 Generalized Conditional Gradient for Structured Matrix Rank Minimization 
1: Initialize C/q, Vq! 

2: for /c = 1, 2,... do 

3: iuk,Vk) ^ top singular vector pair of — V/(?7fc_il4_i); 

4: set rjk <— 2/(/c + 1), and 9k by (13); 

5: Ginit ^ (\/l \^0kUk)'i Vinit ^ (\/l 'Hk^k—li 

6: {Uk, 14) ^ argmin4([/, V) using initializer (Cinit, Rinit); 

7: end for 


Assumption 1 There exists some norm || • || and some constant L > 0, such that for all A, B £ 
■^NxM ^ 

/((I - 7j)A + r/R) < /(A) + r?(R - A, V/(A)) + ^\\B - Af. 

Most standard loss functions, such as the quadratic loss we use in this paper, satisfy Assumption 

1 . 

We are ready to state the convergence property of Algorithm 1 in the following theorem. To 
make the paper self-contained, we also reproduce the proof in the appendix. 

Theorem 1 Let Assumption 1 hold, X be arbitrary, and Xk he the k-th iterate of Algorithm 1 
applied on the problem (7), then we have 

2C 

ct^iXk) - ct>{X) < —, (14) 

where C is some problem dependent absolute constant. 

Thus for any given accuracy e > 0, Algorithm 1 will output an e-approximate (in the sense of 
function value) solution in at most 0(l/e) steps. 

2.4 Comparison with existing approaches 

We briefly compare the efficiency of Algorithm 1 with the state-of-the-art approaches; more thor¬ 
ough experimental comparisons will be conducted in Section 3 below. The per-step complexity of 
our algorithm is dominated by the subproblem (9) which requires only the leading singular vector 
pair of the gradient. Using the Lanczos algorithm this costs 0{MN) arithmetic operations [16], 
which is significantly cheaper than the 0(min(M^A, A^M)) complexity of [11] (due to their need 
of full SVD). Other approaches such as [25] and [17] are even more costly. 

3 Experiments 

In this section, we present empirical results using our algorithms. Without loss of generality, we 
focus on two concrete structured rank minimization problems; (1) stochastic system realization 
(SSR); and (ii) 2-D spectral compressed sensing (SGS). Both problems involve minimizing the rank 
of two different structured matrices. For SSR, we compare different first-order methods to show 
the speedups offered by our algorithm. In the SGS problem, we show that our formulation can be 
generalized to more complicated linear structures and effectively recover unobserved signals. 
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3.1 Stochastic System Realization 

Model. The SSR problem aims to find a minimal order autoregressive moving-average (ARMA) 
model, given the observation of noisy system output [11]. As a discrete linear time-invariant (LTI) 
system, an AMRA process can be represented by the following state-space model 

st+i = Dst +Eut, zt = Fst + ut, t = l,2,...,T, (15) 

where st G W is the hidden state variable, ut G is driving white noise with covariance matrix 
G, and zt G is the system output that is observable at time t. It has been shown in [20] that the 
system order r equals the rank of the block-Hankel matrix (see appendix for definition) constructed 
by the exact process covariance y* = ^{ztzf^^j^), provided that the number of blocks per column, j, 
is larger than the actual system order. Determining the rank r is the key to the whole problem, 
after which, the parameters D,E,F,G can be computed easily [17, 20]. Therefore, finding a low 
order system is equivalent to minimizing the rank of the Hankel matrix above, while remaining 
consistent with the observations. 

Setup. The meaning of the following parameters can be seen in the text after E.q. (1). We 
follow the experimental setup of [11]. Here, m = n, p = nxn{j+k — l), while v = {vi,V 2 , .■.,Vj^k-i) 
denotes the empirical process covariance calculated as u* = ;^ Ylt=i ■, for 1 < f < A: and 0 

otherwise. Let w = (rci, ^ 2 , •••,'U^j+fc-i) be the observation matrix, where the Wi are all I’s for 
1 < i < A:, indicating the whole block of Vi is observed, and all O’s otherwise (for unobserved 
blocks). Finally, A{y) = vec{w o y), b = Yec{w o v), Q{y) = Hn^n,j,kiy), where o is the element-wise 
product and is Hn,n,j,k{-) the Hankel matrix (see Appendix for the corresponding B and G). 

Data generation. Each entry of the matrices D G E G F G is sampled 

from a Gaussian distribution N(0,1)- Then they are normalized to have unit nuclear norm. The 
initial state vector sq is drawn from N{0,lr) and the input white noise ut from N{0,ln)- The 
measurement noise is modeled by adding an term to the output zt, so the actual observation is 
Zt = Zt + where each entry of ^ G M” is a standard Gaussian noise, and a is the noise level. 
Throughout this experiment, we set T = 1000, a = 0.05, the maximum iteration limit as 100, and 
the stopping criterion as ||xfc+i — x^Hf < 10“^ or | ^ 10“^. The initial iterate is a 

matrix of all ones. 

Algorithms. We compare our approach with the state-of-the-art competitors, i.e., the first- 
order methods proposed in [11]. Other methods, such as those in [15, 17, 25] suffer heavier com¬ 
putation cost per iteration, and are thus omitted from comparison. Fazel et al. [11] aim to solve 
either the primal or dual form of problem (2), using primal ADMM (PADMM), a variant of primal 
ADMM (PADMM2), a variant of dual ADMM (DADMM2), and a dual proximal point algorithm 
(DPPA). As for solving (7), we implemented generalized conditional gradient (GGG) and its local 
search variant (GCGLS). We also implemented the accelerated projected gradient with singular 
value thresholding (APG-SVT) to solve (8) by adopting the FISTA [2] scheme. To fairly compare 
both lines of methods for different formulations, in each iteration we track their objective values, 
the squared loss ^\\A{Gx) — 6||p (or ^\\A{y) — 6||p), and the rank of the Hankel matrix Hm,n,j,k{y)- 
Since square loss measures how well the model fits the observations, and the Hankel matrix rank 
approximates the system order, comparison of these quantities obtained by different methods is 
meaningful. 

Result 1: Efficiency and Scalability. We compare the performance of different methods 
on two sizes of problems, and the result is shown in Figure 2. The most important observation is, 
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our approach GCGLS/GCG significantly outperform the remaining competitors in term of running 
time. It is easy to see from Figure 2(a) and 2(b) that both the objective value and square loss 
by GGGLS/GGG drop drastically within a few seconds and is at least one order of magnitude 
faster than the runner-up competitor (DPPA) to reach a stable stage. The rest of baseline methods 
cannot even approach the minimum values achieved by GGGLS/GGG within the iteration limit. 
Figure 2(d) and 2(e) show that such advantage is amplified as size increases, which is consistent 
with the theoretical finding. Then, not surprisingly, we observe that the competitors become even 
slower if the problem size continues growing. Hence, we only test the scalability of our approach 
on larger sized problems, with the running time reported in Figure 1. We can see that the running 
time of GCGLS grows linearly w.r.t. the size MN, again consistent with previous analysis. 

Result 2: Rank of solution. We also report the rank 
of Hn^n,j,k{y) versus the running time in Figure 2(c) and 2(f), 
where y = Cx if we solve (2) or y directly comes from the 
solution of (7). The rank is computed as the number of sin¬ 
gular values larger than 10“^. For the GGGLS/GGG, the iter¬ 
ate starts from a low rank estimation and then gradually ap¬ 
proaches the true one. However, for other competitors, the 
iterate first jumps to a full rank matrix and the rank of later 
iterate drops gradually. Given that the solution is intrinsically 
of low rank, GGGLS/GGG will probably find the desired one 
more efficiently. In view of this, the working memory of GGGLS 
is usually much smaller than the competitors, as it uses two low 
rank matrices U, V to represent but never materialize the solution until necessary. 



Figure 1: Scalability of GCGLS and 
GCG. The size (M, N) is labeled out. 


3.2 Spectral Compressed Sensing 

In this part we apply our formulation and algorithm to another application, spectral compressed 
sensing (SGS), a technique that has by now been widely used in digital signal processing applications 
[6, 9, 29]. We show in particular that our reformulation (7) can effectively and rapidly recover 
partially observed signals. 

Model. The problem of spectral compressed sensing aims to recover a frequency-sparse signal 
from a small number of observations. The 2-D signal 0 < k < ni,0 < ? < n 2 is supposed 

to be the superposition of r 2-D sinusoids of arbitrary frequencies, i.e. (in the DFT form) 


Y{k,l) = ^ 

i=l 




.j2-K fu\k ^^j2TT f2i\l 


2=1 


(16) 


where di is the amplitudes of the z-th sinusoid and {fxi, fyi) is its frequency. 

Inspired by the conventional matrix pencil method [14] for estimating the frequencies of sinu¬ 
soidal signals or complex sinusoidal (damped) signals, the authors in [6] propose to arrange the 
observed data into a 2-fold Hankel matrix whose rank is bounded above by r, and formulate the 
2-D spectral compressed sensing problem into a rank minimization problem with respect to the 
2-fold Hankel structure. This 2-fold structure is a also linear structure, as we explain in the ap¬ 
pendix. Given limited observations, this problem can be viewed as a matrix completion problem 
that recovers a low-rank matrix from partially observed entries while preserving the pre-defined 
linear structure. The trace norm heuristic for rank (•) is again used here, as it is proved by [5] to 











Figure 2: Stochastic System Realization problem with j = 21, k = 100, r = 10, /r = 1.5 for formulation (2) 
and /i = 0.1 for (7). The first row corresponds to the case M = 420, N = 2000, n = m = 20, . The second 
row corresponds to the case M = 840, N = 4000, n = m = 40. 


be an exact method for matrix completion provided that the number of observed entries satisfies 
the corresponding information theoretic bound. 

Setup. Given a partial observed signal Y with as the observation index set, we adopt the 
formulation (7) and thus aim to solve the following problem: 

“ ^f^(^)llF + ^\\Bx\\j + n\\X\\^ (17) 

where x = vec(X), mat(-) is the inverse of the vectorization operator on Y. In this context, as 
before, A = Pq, b = Pn{Y), where Pn{Y) only keeps the entries of Y in the index set and 

vanishes the others, Q{Y) = is the two-fold Hankel matrix, and corresponding B and 

( 2 ^\ 

C can be found in the appendix to encode = X . Further, the size of matrix here is 

M = kik2, N = {m — ki + l)(n2 — k2 + 1). 

Algorithm. We apply our generalized conditional gradient method with local search (GCGLS) 
to solve the spectral compressed sensing problem, using the reformulation discussed above. Follow¬ 
ing the experiment setup in [6], we generate a ground truth data matrix Y G j^ioixioi through a 
superposition of r = 6 2-D sinusoids, randomly reveal 20% of the entries, and add i.i.d Gaussian 
noise with amplitude signal-to-noise ratio 10. 

Result. The results on the SGS problem are shown in Figure 3. The generated true 2-D signal 
Y is shown in Figure 3(a) using the jet colormap. The 20% observed entries of Y are shown in 
Figure 3(b), where the white entries are unobserved. The signal recovered by our GCGLS algorithm 
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(a) True 2-D Sinosuidal Signal 


(b) Observed Entries 


(c) Recovered Signal 
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(d) Observed Signal on Column 1 


(e) Recovered Signal on Column 1 


Figure 3: Spectral Compressed Sensing problem with parameters ni = n 2 = 101, r = 6, solved with our 
GCGLS algorithm using fci = A :2 = 8,/i = 0.1. The 2-D signals in the first row are colored by the jet 
colormap. The second row shows the 1-D signal extracted from the first column of the data matrix. 


is shown in Figure 3(c). Comparing with the true signal in Figure 3(a), we can see that the result 
of our CGCLS algorithm is pretty close to the truth. To demonstrate the result more clearly, 
we extract a single column as a 1-D signals for further inspection. Figure 3(d) plots the original 
signal (blue line) as well as the observed ones (red dot), both from the first column of the 2-D 
signals. In 3(e), the recovered signal is represented by the red dashed dashed curve. It matches 
the original signal with significantly large portion, showing the success of our method in recovering 
partially observed 2-D signals from noise. Since the 2-fold structure used in this experiment is more 
complicated than that in the previous SSR task, this experiment further validates our algorithm 
on more complicated problems. 

4 Conclusion 

In this paper, we address the structured matrix rank minimization problem. We first formulate the 
problem differently, so that it is amenable to adapt the Generalized Conditional Gradient Method. 
By doing so, we are able to achieve the complexity 0{MN) per iteration with a convergence rate 
O (^). Then the overall complexity is by far the lowest compared to state-of-the-art methods for 
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the structured matrix rank minimization problem. Our empirical studies on stochastic system 
realization and spectral compressed sensing further confirm the efficiency of the algorithm and the 
effectiveness of our reformulation. 
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Appendix: Efficient Structured Matrix Rank Minimization 


Proof of Theorem 1 

The proof follows the line of that in [27]. 

Fix the “competitor” X. We first show that 


< (1 - rik)(t){Xk-i) + 'nk(t>{X) + 


Ckvl 

2 


where Ck := L ■ 


X\\^Zk — Xk-i 


2 

. Indeed, 


( 18 ) 


cPiXk) = f{Xk) + ^^\\Xk\U 

= min/((l - rjk)Xk-i + 9Zk) + /u(l - rjk)\\Xk-i\U + 

0^0 

< /((I - r]k)Xk-i + rik\\X\\^Zk) + /i(l - r]k)\\Xk-i\\* + nr]k\\X\\^ 

< f{Xk-i) + Vk{\\x\uzk - Xk-i,vf{Xk-i)) + ^ + m(i - vk)\\Xk-i\U + fkm\\x\U 

= cj){Xk-i) + vk{\\x\UZk - Xk-i,vf{Xk-i)) + ^ - fim\\Xk-i\U + All* 

Ckv'^ 

- M „ 4'{Xk-i)+ 'nk{Y - Xk-i,vf{Xk-i)) ^^ - At%IAfc-i||* + WfclAII* 

- M ,, +Vk{f(Y) - f{Xk-i))+ - Hrik\\Xk-i\U + Hrik\\X\\^ 

= (1 - r]k)(l){Xk-i) +Vk mm (/(T) + ^1 All*) + 

Ckvi 

= {l-rikmXk-i)+mHX) + ^. 


[( 12 )] 


[Assumption 1] 


[(9)] 


[convexity of /] 


Note that we only need the local search (line 6 of Algorithm 1) to satisfy the descent property 
'ip{UkVk) < 'ip{Uk-iVk-i), so that by induction i^{UkVk) < 'ip{UoVo) = Cq for some constant Cq. 
Thus Afcll = Afcl4|| is uniformly bounded, meaning that the term Ck in (18) can be bounded by 
a universal constant C (which depends on the competitor X that we fix throughout). 

Therefore, we have 


HXk) < (1 - vkMXk-i) + mHx) + (19) 

Let C = meix{C', (f>{Xi) — (p{X)). Then we show by induction that (14) holds. 

1. When k = l, (/>(Xi) - (j){X) < C, (14) holds. 

2. Suppose Theorem 1 holds for the k-th steps, i.e. 4>{Xk) — 4>{X) < we show that it also 
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holds for the [k + l)-th step. Indeed, by (19) and rjk+i = we have 


<)>(Xfc+i) - <)>(x) < (1 - iik+i){4>{Xk) - 4 >{x)) + 

^ k 2C 2C 

- k + 2 ' k + l ^ (A:+ 2)2 
_ 2C{k‘^ + 3k + l) 

~ JkTTj{k + W 

< 

- k + 2' 

This concludes the proof of Theorem 1 for all steps k. 

Linear Structured Matrices and the corresponding Matrix B and C 

General Linear Matrix Structures 

In general, linear matrix structures are defined [15] as: 

Uy 

Q{y) = Qo + '^ QkVk ( 20 ) 

k=l 

where Qk G 0 < k < Uy and y G is the given data. Let Qk,iii < ’mn) be the i’th element 

in vec{Qk)- 

We further assume that (1) Qo = 0, (2) each Qk is a (0,l)-matrix and (3) for Vi < mn, there 
exists at most one k such that Qk,i = 1. In other words, each element in the structured matrix Q{y) 
either equals to one element in y, or is 0. Most of the linear matrix structures, including block- 
Hankel and 2-fold Hankel used in our experiments, as well as Toeplitz, Sylvester and circulant, 
satisfy this assumption. 

We claim that for any structure Q : M”'*' —)• under this assumption, we can construct a 

“structure preserving matrix” B and a “recovery matrix” C such that for any X G 

Bvec{X) = 0 3y G s.t. X = Q{y) and Cvec{X) = y (21) 

or in other words, Bvec{X) = 0 <+ X G image(Q), where image(Q):= {Q{y)\y G B can be 

viewed as the Lagrangian of the structural constraint. 

The matrix B can be constructed in the following way. Let be the position of the jth 1 in 
vec{Qk)- Let \Qk\ be the number of I’s in Qk- The structure defined above requires that for any 
X G image(Q), each pair of {X,j , X.j-t-i) must be equal. Since there are totally T = “ 1) 

such pairs, B can be constructed as a T x mn sparse matrix by only assigning B^^j = 1, ^j+i = — 1 
for the tth pair of X^j = X^j+i constraint. In case we need to enforce some elements of X to be 
zero, we may add more rows to B with only one 1 per row at the position of the focused element. 

The matrix C can be constructed as a Uy x mn sparse matrix by assigning = l/\Qk\ and 
leaving other entries 0. Note that this C can be applied to arbitary X G orthogonal 

projection onto image(Q), i.e. 

Q(Cvec(X)) = arg min \\X — X||p 

Xeimage(Q) 
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Thus we call this C the projection matrix Cproj- One may refer to the Appendix of [15] for the 
proof. C can be also constructed in other ways to satisfy (21), for instance, a sparser C can be 
constructed by assigning only = 1 for 1 < /c < Uy. It’s easy to verify that the sparser one Csp 
is also an inverse operator of vec(Q(-)) 

Example: Hankel Matrix 

In the following examples, we always use Ik to denote the identity matrix of size k xk and 
to denote a zero matrix of size k x j. 

For a Hankel matrix of data y G parameterized by j and k: 

yi y 2 • • • yk 

y2 ys • • • Vk+i 

yj ^i+i ‘ ‘ ‘ yj+k—i 

The Hankel structure preserving matrix B G ^(■^“^^(^“^^^•^^(after rearranging the order or rows) 
is 

P N 0 0 ••• 0 O' 

0 P N 0 0 0 

: : : : : : : 

0 0 0 0 P N 

where P = [Oj_ip,N = [—/j_i, 0 = Oj-ij- Obviously P,N,0 G 

For the recovery matrix C G we show a toy example using parameters j = 2,k = 3. 

The projection Cproj and the sparser Cgp are 

'1 0 0 0 0 O' 

_ 0 0.5 0.5 0 0 0 

0 0 0 0.5 0.5 0 

0 0 0 0 0 1_ 

Example: Block-Hankel Matrix 

For the block-Hankel matrix used in the stochastic system realization experiment 

2/1 2/2 • • • yk 

2/2 2/3 • • • yk+i 

yj 2/j+i ■ ■ ■ Vj+k—i 

where each yi,... ,yj^k-i is a m x n data matrix. If we define vec(-) blockwise, we can write 
the matrix B G wnU-^){k-i)xmjk in the same form as (23) where P = N = 

0 = P,N, 0 G 

The matrix C G constructed from (24) by replacing each element a with 

a block aim- 

Example: Two-Fold Hankel Matrix 




1 0 0 0 0 0 
0 1 0 0 0 0 
0 0 0 1 0 0 
0 0 0 0 0 1 
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Here we use to denote the 2-fold Hankel structure. has M = kik 2 rows and N = 

(ni — ki + l)(n 2 — A :2 + 1) columns. 

Here H is a matrix with /ci(/c 2 — l)(n 2 — k 2 ){ni — A:i -|- 1) -|- n 2 {ki — l)(ni — ki) rows and MN 
columns: 

Hi 0 • • • 0 
0 Hi • • • 0 

q _ _ 0 _ _ •_ Hi 

B2 

such that each Hi G ]^fci(*: 2 -i)(n 2 -fc 2 )xfcifc 2 (n 2 -fc 2 +i) preserves the micro Hankel structure of ki blocks 
in one “block-wise column” of X and H 2 G ^n 2 {ki-i){m-kr)xMN preserves the global block-Hankel 
structure. Both Hi and H 2 as well as the recovery matrix Cproj are constructed using the steps 
mentioned above. 
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